Heatmap showing distances between documents
The heatmap plot for distances between the documents using normalised Euclidean distance metric. Blue-pink gradient color was used to show the correlation between the documents where pink color showed high correlation between the documents and blue color showed low correlation.
Clustered Heatmap showing distances between documents
The clustered heatmap plot also presents the distances between the documents as the heatmap plot using normalised Euclidean distance metric. It is just an another way to visualize the distances between the documents. Blue-pink gradient color was used to show the correlation between the documents where pink color showed high correlation between the documents and blue color showed low correlation.
Dendogram showing hierarchical clustering of documents
The dendogram presents the hierarichal clustering of documents using the ward method. The number of clusters were determined by cutting the dendogram using a vertical dotted line at the value 1.0, resulting in 4 main clusters (C1, C2, C3 and C4) and 11 sub-clusters. C1 was related to C2 and C3 was related to C4. C1 consisted of documents 1 and 2 as one sub-cluster and documents 3 and 4 as another. Similarly, C2 consisted of documents 8 and 9 as first sub-cluster; documents 5, 6 as second sub-cluster and document 7 as third sub-cluster. C3 consisted of documents 10 and 11 as one sub-cluster and 12 and 13 as another. In contrast, C4 consisted of documents 19 as first sub-cluster; documents 17 and 18 as second sub-cluster; document 16 as third sub-cluster; and documents 14 and 15 as fourth sub-cluster.
The libraries and the dataset required to perform document clustering in R were loaded. Then, a document term matrix (DTM) was created, followed by a term frequency-inverse document frequency (TF-IDF) matrix. The number of clusters was determined using the elbow method where the bend or knee in the plot is considered as an indicator of an optimal number of clusters for the data. The bend further indicates that any additional number of clusters beyond the bend will have little or no value. It can be observed from figure below that the bend was formed at number 5. Therefore, 5 clusters were used to fit the data. Further, the figure showed that as the number of clusters (k) increases, the variance decreases.
Euclidean distance method was used to determine the distance between the documents. This distance is used by the hierarchical clustering algorithm to cluster the documents. The following figure represents the distance matrix of the dataset. The red color indicated high similarity while the blue color indicated low similarity. The color level is proportional to the value of the similarity between documents where pure red represents 0, and pure blue represents 1. As it can be observed from the figure that document number 2, 19, 13, and 8 were very dissimilar to all the other documents and even to each other (colored in blue) and thus can be represented as different clusters whereas the rest of the documents shows little or more similarity to each other and can be clustered together in one group.
Another way to visualize the distances between the documents is hierarchical clustering with dendrograms. A dendrogram is a tree diagram that shows the hierarchical relationship between the data points (in our case, different documents). Hierarchical clustering was performed using the ward’s method, and the tree was cut with 5 clusters at distance 70. The x-axis of the figure indicates the document’s S.No, whereas the y-axis of the figure shows the distance or dissimilarity between the clusters calculated by the Euclidean method. The distance is inversely proportional to the similarity of the documents that means smaller the distance between the two documents more similar they are. The documents that are close to each other have small dissimilarity and were linked together. It can be observed from the figure that five clusters viz. C1, C2, C3, C4 and C5 were identified where the distance between documents in C1 kept increasing with the level of merger. Fifteen documents were clustered together in one cluster (C1) based on their similarity whereas each of the remaining four clusters consisted of just one document which meant there was more number of documents on a subject/topic than the others in the dataset.
The dendrogram in hierarchical clustering can be visualized with a rectangular, circular or phylogenic structure. It is a way of visualizing the same results with different perspective according to your research problem and dataset.
The dendrogram in hierarchical clustering can be visualized with a rectangular, circular or phylogenic structure. It is a way of visualizing the same results with different perspective according to your research problem and dataset.
Timeline showing the core topics in DESIDOC Journal of Library and Information Technology from 1981 to 2018 (©2019 Springer Nature, all rightsreserved – reprinted with permission from Springer Nature, published in Lamba andMadhusudhan
Eight time-slices were finalized based on the volume distribution of DJLIT articles. For each time-slice, a different number of topics were chosen to fit the distribution of research articles. 50 core topics were identified that fitted the corpus of DJLIT research articles wherein only 29 topics were identified as unique.
Latent Dirichlet Allocation Topic and Word Result for PQDT Global ETDs during 2014-2018 (n=441)
The results from the RapidMiner platform were analysed to assign appropriate topics to the corpus of ETDs. The table sums up the LDA results for the study that shows the labeling of the topics (a through h) in descending order as per their probability values wherein Topic-a had the highest probability value. In order to determine the topics, both co-occurring words and representative ETDs were taken into account. Representative ETDs are the top five ETDs that were ranked according to the highest topic proportion value for a chosen topic. Eight core topics were identified where Number of articles=441; Number of Words=5; AlphaSum=1.874; Beta= 0.06.
The libraries and the dataset required to perform topic modeling in R were loaded. The loaded data was then cleaned to remove stopwords, punctuation, numbers, and whitespaces. Structural topic modeling (STM) was initialized with 5 topics (k).The figure shows first way of representing the document-topic proportion of the corpus that belongs to each topic. Further, the figures listed the top words with highest probability that were associated with the selected number of topics.
The figure shows second way of representing the document-topic proportion of the corpus that belongs to each topic. Further, the figures listed the top words with highest probability that were associated with the selected number of topics.
The figure shows third way of representing the document-topic proportion of the corpus that belongs to each topic. Further, the figures listed the top words with highest probability that were associated with the selected number of topics.
The figure shows fourth way of representing the document-topic proportion of the corpus that belongs to each topic. Further, the figures listed the top words with highest probability that were associated with the selected number of topics.
In the following code,n represents the number of representative documents and topics indicates the topic number. The code was run each time for all the 5 topics(where topics = 1, 2, 3, 4, 5) for a constant value of n (where n = 5).Table 2.3 presents the result for top five representative ETDs for the modeled topics and are ranked according to their probability.
It can be observed that both Topic-5 and Topic-4are more common in the corpus in compared to other topics (Fig. 2.4). The high probability keywords in Fig.2.4 and the representative ETDs in Table 2.3 indicate that Topic-1 focuses on evaluation specifically of resources, websites, and content;Topic-2 emphasis on library services; Topic-3 represents information management;Topic-4 is about information technology; and Topic-5 is on user studies.
The correlation between the topics using a network graph was identified. It was found that Topic-1 (evaluation) was related to Topic-2 (library services), and Topic-3 (information management) was related to Topic-4 (information technology).Topic-5 (user studies) was isolated and not related to other topics.
Word Co-Occurrence Network of Titles on Malaria Disease in Web ofScience in 2019
The figure presents the word co-occurrence network for top 50 words that represent the literature indexed in Web of Science (WoS) database on malaria disease for year 2019. Three main communities (clusters)were identified which were represented by green, red and blue colors in the figure.
Cluster 1 (red color) consisted of 17 words (nodes) that were about the parasites - “plasmodium falciparum” and “plasmodium vivax” that cause the malaria disease with an emphasis on parasite’s genetic, molecular, response rate,and detection rate. Both the nodes “plasmodium” and “falciparum” had the highest score for all the centrality measures in cluster 1 followed by the nodes -infection,blood,vivax, and human.
Cluster 2 (blue color) consisted of 20 words (nodes) that were about “malaria” with a focus on its risks, treatment, transmissions, prevalence,clinical trials and factors, Africa, Uganda, and children. In cluster 2, the node “malaria” had the highest value for all the centrality measures followed by study,children,associated,analysis,treatment and transmission.
Cluster 3 (green color)consisted of 13 words (nodes) that were about “anopheles” which is the genus of mosquito that transmit malaria to humans with an emphasis on gambiae species,vector, evaluation, drug resistance and activity. The node “anopheles” had the highest value for all the centrality measures in cluster 3 followed by resistance,antimalaria,evaluation,drug, vector, and species
Text Network showing 22 Communities of Latent Topics
The figure represents the 22 clusters/communities of 238 words (nodes) which were determined from the network text analysis of the data.
Screenshot of evaluation result
A prediction model using SVM classifier was created and evaluated for the study. The true class was compared to the predicted class to determine the evaluation metrics, that is, kappa, precision, and recall values. With 99.70% kappa value and more than 90% for recall and precision values for all the eight topics, the suggested model can be considered good. The limitation of this study is that this dataset was not representative of library science ETDs in the PQDT Global database and to have some reliable results we need to add more training data to learn the model.